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SCALABLE ALGORITHMS FOR REGULARIZED PRECISION 
MATRICES VIA STOCHASTIC OPTIMIZATION 

YVES F. ATCHADE, RAHUL MAZUMDER, AND JIE CHEN 


Abstract. We consider the problem of computing a positive definite pxp inverse covari¬ 
ance matrix aka precision matrix 6 = (Oij) which optimizes a regularized Gaussian max¬ 
imum likelihood problem, with the elastic-net regularizer \{a\6ij I + 

with regularization parameters a € [0,1] and A > 0. The associated convex semidefinite 
optimization problem is notoriously difficult to scale to large problems and has demanded 
significant attention over the past several years. We propose a new algorithmic frame¬ 
work based on stochastic proximal optimization (on the primal problem) that can be used 
to obtain near optimal solutions with substantial computational savings over determin¬ 
istic algorithms. A key challenge of our work stems from the fact that the optimization 
problem being investigated does not satisfy the usual assumptions required by stochastic 
gradient methods. Our proposal has (a) computational guarantees and (b) scales well 
to large problems, even if the solution is not too sparse; thereby, enhancing the scope of 
regularized maximum likelihood problems to many large-scale problems of contemporary 
interest. An important aspect of our proposal is to bypass the deterministic computation 
of a matrix inverse by drawing random samples from a suitable multivariate Gaussian 
distribution. 


1. Introduction 


We consider the problem of estimating an inverse covariance matrix aka precision ma¬ 


trix (Lauritzen, 1996) 9, from a data matrix X^xp comprised of n samples from ap dimen¬ 
sional multivariate Gaussian distribution with mean zero and covariance matrix S = 
i.e., Xj N(0, S) for z = 1,..., n. If n < p it is a well known fact that the Maximum 
Likelihood Estimate (MLE) does not exist, and even if it does exist (n > p) the MLE can 
be poorly behaved and regularization is often called for. Various forms of regularization 
are used to improve the statistical behavior of covariance matrix estimates (jPourahmadi 


2013 Biihlmann and Van De Geer, 2011; Hastie et al. 2009) and is a topic of significant 
interest in the statistics and machine learning communities. This paper deals with the 
problem of computing such regularized matrices, in the settings where p is much larger 
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than n or both p and n are large. To motivate the reader, we briefly review two popular 
forms of precision matrix regularization schemes under a likelihood framework: sparse 
precision matrix estimation via ^i-norm regularization, and its dense counterpart, using 
an .^ 2 -iiorm regularization (ridge penalty); both on the entries of the matrix 9. 


Sparse precision matrix estimation — the Graphical Lasso. One of the most popular 
regularization approaches and the main motivation behind this paper is the Graphical 
Lasso (Yuan and Lin, 2007 Banerjee et ah, 2008 Friedman et al. 2007b|) procedure aka 


Glasso. Here, we estimate 6 under the assumption that it is sparse, with a few number 
of non-zeros. Under the multivariate Gaussian modeling set up, 6ij = 0 (for i ^ j) is 
equivalent to the conditional independence of Xi and xj given the remaining variables, 
where, x = (xi,..., Xp) ~ N(0, S). Glasso minimizes the negative log-likelihood subject 
to a penalty on the ii norm of the entries of the precision matrix 9. This leads to the 
following convex optimization problem ( Boyd and Vandenbergh^ 2004); 

-logdet^-h Tr(0S')-hAy^ (1) 

V__ y < ^ 


minimize 






where, S = ^ sample covariance matrix, A4+ denotes the set of positive 

definite matrices and A > 0 is a tuning parameter that controls the degree of regulariza- 
tioi|^ In passing, we note that the Glasso criterion, though motivated as a regularized 
negative log-likelihood problem, can be used more generally for any positive semidefinite 
(PSD) matrix S. 

In modern statistical applications we frequently encounter examples where Problem 0 
needs to be solved for p of the order of several thousands. Thus there is an urgent need to 
develop fast and scalable algorithms for Problem ©• In this vein, the past several years 
have witnessed a flurry of interesting work in developing fast and efficient solvers for Prob¬ 
lem 0. We present a very brief overview of the main approaches used for the Glasso 
problem, with further additional details presented in the Appendix, Section A represen¬ 
tative list of popular algorithmic approaches include (a) block (where, each row/column 
is a block) coordinate methods (Banerjee et ah, 2008[ Friedman et ah, 2007a[ Mazumder 


and Hastie , |2012b ); (b) proximal gradient descent type methods ( [Banerjee et ah ~ [2008 
Lu| 2009] ~ Rolfs et ah, 2012[); (c) methods based on Alternating Direction Method of 


Multipliers (Scheinberg et ah, 2010| Boyd et ah, 2011; Yuan, 2012); (d) specialized inte¬ 


rior 


point methods (Li and Toh[ [2010 ); and (e) proximal Newton type methods (Hsieh 


et ah 2014 Oztoprak et ah, 2012). All the aforementioned methods are deterministic 
in nature. Precise (global) computational guarantees are available for some of them. It 
appears that most of the aforementioned computational approaches for Problem ([^ , have 
a (worst-case) cost of at least 0{p^) or possibly larger—this is perhaps not surprising, 
since for A = 0, finding the MLE requires computing S~^ (assuming that th e inverse ex- 
ists), with cost 0{p^). Many of the state-of-the art algorithms for Glasso (Hsieh et ah 


2014, Friedman et ah, 2007a) (for example) make clever use of the fact that solutions to 


^As long as A > 0, the minimum of Problem Q is finite (see Lemma[^ and there is a unique minimizer. 
In some variants of Problem 0. the diagonal entries of 0 are not penalized—such an estimator can infact 
be written as a version of Problem Q, with S ■<— S — AIpxp where, I is a p x p identity matrix. The 
minimum of this problem need not be finite. In this paper, however, we will consider formulation 0 
where the diagonals are penalized. 
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Problem 0 are sparse, for large values of A. Another important structural property of 
Glasso, that enables the scalable computation of Problem 0 is the exact thresholding 
property ( [Mazumder and Hastie , p012a| Witten et ah, 2011[ ). The method is particularly 
useful for large values of A, whenever the solution to the Glasso problem decomposes 
into smaller connected components; and becomes less effective when the solution to the 
Glasso problem is not sufficiently sparse. In short, computing solutions to Problem © 
become increasingly difficult as soon as p exceeds a few thousand. 

All existing algorithms proposed for Glasso, to the best of our knowledge, are deter¬ 
ministic batch algorithms. To improve the computational scalability of Problem Q, we 
consider a different approach in this paper. Our approach, uses for the first time, ideas 
from stochastic convex optimization for the Glasso problem. 


From sparse to dense regularization. We consider another traditionally important regu¬ 
larization scheme, given via the following optimization problem: 

minimize — logdet 9 + Tr(6S) + — Of,, (2) 

for some value of A > 0. This can be thought of as the ridge regularized versioij^ of 
Problem Q. We will see in Section that Problem Q admits an analytic solution which 
requires computing the eigen-decomposition of S, albeit difficult when both n and p are 
large. Note that many of the tricks employed by modern solvers for Glasso, anticipating a 
sparse solution, no longer apply here. The stochastic convex optimization framework that 
we develop in this paper also applies to Problem Q, thereby enabling the computation 
of near-optimal solutions for problem-sizes where the exact solution becomes impractical 
to compute. 

In this paper, we study a general version of Problems Q and 0 by taking a convex 
combination of the ridge and (.i penalties: 


minimize 

e&M+ 


— log det 9 + Tr(0S') -|- 

'• -V-' ^ 

:=/( 0 ) ^ 


(^X\9ij\ + 


(1 - «) 
2 



■=9a{d) 




(3) 


with a G [0,1]. Following Zou and Hastie (2005), we dub the above problem as the 
elastic net regularized version of the negative log-likelihood. Notice that for a = 1 we get 
Glasso and a = 0 corresponds to Problem ([^. We propose a novel, scalable framework 
for computing near-optimal solutions to Problem ^ via techniques in stochastic convex 
optimization. 


1.1. Organization of the paper. The remainder of the paper is organized as follows. 
Section provides an outline of the methodology and our contributions in this paper. We 
study deterministic proximal gradient algorithms in Section We present the stochastic 
algorithms, proposed herein—Algorithm and Algorithm in Section We describe the 


^Note that some authors ( Warton] |2008| refer to a different problem as a ridge regression problem, 
namely one where one penalizes the trace of 0 instead of the frobenius norm of 0. Such regularizers 


1989 Hastie et al. 19951. 


are often used in the context of regularized discriminant analysis (Friedman 
However, in this paper we will denote Problem 0 as the ridge regularized version of the Gaussian maximum 
likelihood problem. 
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exact thresholding rule for Problem in Section The application of the stochastic 
algorithm (Algorithm to the ridge regularized problem (Problem is presented in 
Section We present some numerical results that illustrate our theory in Section 
The proofs are collected in Section and some additional material are presented in the 
appendix. 


2. Outline of the paper and our contributions 


Deterministic Algorithms. The starting point of our analysis, is the study of a (determin¬ 
istic) proximal gradient (Nesterov ( |2013 ); Beck and Teboulle (2009); Becker et al. (2011); 
Parikh and Boyd ( 2013[ )) algorithm (Algorithm 1) for solving Problem (3). A direct ap¬ 
plication of the proximal gradient algorithm (Nesterov (2013); Beck and Teboulle (2009), 
for example) to Problem ([^ has 


some issues. 


Firstly, the basic assumption of Lipschitz 
continuity of the gradient V/(0), demanded by the proximal gradient algorithm, is not 
satisfied here. Secondly, the proximal operator associated with Problem ([^ is difficult 
to compute, as it involves minimizing an ii regularized quadratic function over the cone 
Ai+. We show that these hurdles may be overcome by controlling the step-size. Loosely 
speaking, we also establish that V/(0) satisfies a Lipschitz condition (and f{6) satisfies a 
strong convexity condition) across the iterations of the algorithm—a notion that we make 
precise in Section Using these key aspects of our algorithm, we derive a global linear 
convergence rate of Algorithm 1, even though the objective function is not strongly convex 
on the whole feasible set A4+. Furthermore, the algorithm has an appealing convergence 
behavior that we highlight: its convergence rate is dictated by the condition numbei|^ 
of 0, a solution to Problem (§• For a given accuracy e > 0, our analysis implies that 
Algorithm has a computational cost complexity of O ^p^cond(0)^ log(e“^)^ to reach a 

e-accurate solution, where cond(0) is the condition number of 0. The computational bot¬ 
tleneck of the algorithm is the evaluation of the gradient of the smooth component at 
every iteration, which in this problem is Vf{9) = —6~^ + S. Computing the gradient 
requires performing a matrix inversion, an operation that scales with p as 0{p ^)—we refer 
the reader to Figure for an idea about the scalability behavior of direct dense matrix 
inversion for a p x p matrix, for different sizes of p. 

Proximal gradient descent methods on the primal of the Glasso problem has been 
studied by Rolfs et al. (2012). Our approaches however, have some differences—our anal¬ 


ysis hinges heavily on basic tools and techniques made available by the general theory of 
proximal methods; and we analyze a generalized version: Problem (§. The main motiva¬ 
tion behind our analysis of Algorithm 1 is that it lays the foundation for the stochastic 
algorithms, our primary object of study in this paper. 


Stochastic Algorithms. For large values of p (larger than a few thousand). Algorithm 1 
slows down considerably, due to repeated computation of the inverse: 9~^ (See also Fig¬ 
ure Q across the proximal gradient iterations. Even if the matrix 9 is sparse and sparse 
numerical linear algebra methods are used for computing 9~^, the computational cost de¬ 
pends quite heavily upon the sparsity pattern of 9 and the re-ordering algorithm used to 


^defined as the ratio of the largest eigenvalue over the smallest eigenvalue 
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reduce fill-ins; and need not be robusl|^ across different problem instances. Thus, our key 
strategy in the paper is to develop a stochastic method that completely bypasses the exact 
computation (via direct matrix inversion) of the gradient Vf{9) = S — 9~^. We propose 
to draw samples zi,..., (a-t iteration k) from N(0, 9'^\) to form a noisy estimate 

S — Ylk=i gradient S — 9'j^^-^^. This scheme forms the main workhorse of 

our stochastic proximal gradient algorithm, which we call Algorithmic 


(Zoomed) 




P 


P 


Figure 1. Figure showing the times in seconds to perform a direct eigen- 
decomposition, inversion and Cholesky decomposition using dense direct nu¬ 
merical linear algebra methods, for real symmetric matrices with size of upto 
p = 65, 000. Eigen decompositions and matrix inversions are less memory friendly, 
when compared to Cholesky decompositions for large problem sizes. The tim¬ 
ings displayed in the graphs support the practical feasibility of using Cholesky 
decomposition methods for large matrices—a main workhorse for the stochastic 
optimization algorithms proposed in the paper. [Right panel] displays a zoomed 
in version of the left panel plot, showing that Cholesky decompositions are sig¬ 
nificantly faster than inversion and eigen-decomposition methods even for smaller 
problems p < 30, 000. The tail of the direct inversion curve on the left deviates 
from the 0{p^) trend because the storage requirement has exceeded the capacity 
of main memory. Thus, the extra time is consumed by the slower virtual memory 
access. [The matrices used here were sparse with proportion of non-zeros 10/p, 
positive definite with the reciprocal of the condition number given by 0.2—we 
used the Matlab function sprandsym to generate the matrices.] 


Stochastic optimization algorithms based on noisy estimates of the gradient have a 
long history that goes back to the pioneering works of 

). As datasets encountered by statisticians in the modern day grow 
larger and the optimization problems associated with statistical estimation tasks become 

^In fact, in our experiments we observed that Matlab performs dense Cholesky decomposition more 
efficiently than sparse Cholesky decomposition, even when the matrix is sparse. This is due in part 
to multithreading: the dense Cholesky decomposition is automatically multithreaded in Matlab, but 
the sparse Cholesky decomposition is not. Another reason is the difficulty of finding a good re-ordering 
algorithm to limit fill-ins when performing sparse Cholesky decomposition. 


and Wolfowitz (1952 


Robbins and Monro (1951); Kiefer 
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increasingly challenging, the importance of stochastic algorithms to deliver scalable solvers 
is being progressively recognized in recent years. See for instance, the recent works in the 
optimization and machine learnin g communities ([Bertsekas (2011); Duchi et al. (2012); 
Shalev-Shwartz and Zhang (2013); Konecny and Richtarik (2013); Xiao and Zhang (2014); 


Atchade et al. (2014), and the references therein). We note however, that our stochastic 
optimization formulation of Problem ([s]) differs from the usual stochastic optimization 


problem (for instance as in Bertsekas (2011)) which solves problems of the form 


minimize 

e 


/((9;x)7r(dx) + ^((9), 


(4) 


for an intractable integral f f(0; x)7r(dx), where, the map 6 i—)■ f{0;x) is smooth, and 
g is possibly non-smooth. A special instance of is when vr is a discrete probability 
distribution over a very large set, making the integral f f(ff;x)7r(dx) = ^ ^ 

large sum and difficult to work with. We make the following remarks that highlight the 
differences between our approach and generic approaches for Problem (g: 

• Problem ([^ does not admit a straightforward representation of the form (Q. 

• The gradient Vf{9) = S — 9~^ has the integral representation S — f xx'TT 0 {dx), 
where vr^ is the density of N(O,0“^), which depends on 9 — a distinctive feature 
that sets our stochastic optimization framework apart from Problem (Q. 

• Last, but not least, the gradient map 9 i—?■ Vf{9) is not Lipschitz continuous on 
M+, the feasible set of Problem 

A main contribution of our paper is to address the above challenges in the context of 
the stochastic optimization framework being proposed herein. In fact, our stochastic 


optimization framework is more in sync with the Robbins-Monro algorithm (Robbins and 


Monro (1951)) and can be viewed as a large-scale and non-smooth variant of the Robbins- 


Monro algorithm, along the lines of Atchade et al. (2014). Note however, that the theory of 


Atchade et al.| (2014) cannot be directly applied here, as it requires the classical Lipschitz- 
continuity assumption of the smooth component of the objective function, and the ability 
to compute the proximal map of the non-smooth component. As explained above, these 
properties are not readily available in our case. 

The main cost of Algorithm 2 lies with generating multivariate Gaussian random vari¬ 
ables from N(0, 0“^). A given iteration of Algorithm is more cost-effective than an 
iteration of the deterministic algorithm, if the Monte Carlo sample size used in that it¬ 
eration is smaller than p. This is because the cost of approximating 9~^ using p random 
samples from N(0, 9~^) is similar to the cost of computing 9~^ by direct matrix inversion. 
We show that with an appropriate choice of the Monte Carlo batch size sequence {A^a:} 
(see Section [4.1| for details). Algorithm reaches a solution with accuracy e, before the 
Monte Carlo sample size becomes larger than p if p > cond(0)^e“^. This result implies 
that Algorithm is more cost-effective than Algorithm in finding e-accurate solutions 
in cases when p is large, the solution 9 is well-conditioned, and we seek a low-accuracy 


is then O ^p^cond(0)^ log(e ^)^. While 


im 1 which performs a direct matrix in¬ 


approximation of 9. The total cost of Algorithm 
on the surface, the cost looks similar to Algorit 
version at every iteration, the constant involved in the big-0 notation favors Algorithm 2 
—see for example. Figure [^showing the differences in computation times between a dense 
Cholesky decomposition and a direct dense matrix inversion. This is further substantiated 
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in our numerical experiments (Section where we do systematic comparisons between 
Algorithms 1 and 2. 

A deeper investigation of our stochastic optimization scheme (Algorithm outlined 
above, reveals the following. At each iteration k, all the random variables (samples) used 
to estimate are discarded, and new random variables are generated to approximate 
9'^^. We thus ask, is there a modified algorithm that makes clever use of the information 
associated with an approximate 9^^-^ to approximate 9'^^? In this vein, we propose a 
new algorithm: Algorithm which recycles previously generated samples. Algorithm 
has a per-iteration cost of 0{Np^) when a Cholesky factorization is used to generate the 
Gaussian random variables, and where N is the Monte Carlo batch-size. The behavior 
of the algorithm is more complex, and thus developing a rigorous convergence guarantee 
with associated computational guarantees analogous to Algorithmj^is beyond the scope of 
the current paper. We however, present some global convergence results on the algorithm. 
In particular, we show that when the sequence produced by Algorithm converges, it 
necessarily converges to the solution of Problem 


Dense problems. We emphasize that a sizable component of our work relies on the speed 
and efficiency of modern dense numerical linear algebra methods for scalability, and thus 
our approach is relatively agnostic to the sparsity level of 9, a solution to Problem © . In 
other words, our approach adapts to Problem ([^ for large n and p, a problem which is 
perhaps not favorable for several current specialized implementations for Problem 0 - 


Exact covariance thresholding. We also extend the exact thresholding rule (Mazumder 
and Hastie, [2012a) originally proposed for the Glasso problem, to the more general case 


of Problem (j^. Our result established herein, implies that the connected components of 
the graph (l(|sij| > Aa)) are exactly equal to the connected components of the graph 
induced by the non-zeros of 9, a solution to Problem Q. This can certainly be used as a 
wrapper around any algorithm to solve Problem Q; and leads to dramatic performance 
gains whenever the size of the largest connected component of (1 (|sjj| > Aa)) is sufficiently 
smaller than p. 

We note that developing the fastest algorithmic implementation for Problem Q or its 
special case, Glasso, is neither the intent nor focus of this paper. We view our work 
as one that proposes a new framework based on stochastic optimization that enables the 
scalable computation for the general class of Problems Q, across a wide range of the 
regularization parameters. The scalability properties of our proposal seem to be favorable 
over deterministic batch methods and in particular, proximal gradient descent methods 
tailored for Problem Q. 


2.1. Notation. Throughout the paper, the regularization parameters A and a G (0,1], 
appearing in Problem Q are assumed fixed and given. Let M. denote the set oi p x p 
symmetric matrices with inner product (A, B) = Tr(A'i?) and the Frobenius norm ||A||p 
■sj (A, A). Ad+ denotes the set of positive definite elements of M.. Let / be the function 
M. —)■ (0, oo] defined by 


f{e) 


— logdet 0-b ^(05") if 0 G AI+ 

-boo if 9 G M \ M 
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We shall write the regularization term in Problem ([^ as 

def 


9„(9) e x: ( aA|«y| + Wl) AO?, 


and 




M0) = f{0)+ga{e), 0GM. (5) 

For a matrix A ^ ||^||2 denotes the spectral norm of A, Amin(^) (respectively Amax(^)) 

denotes the smallest (respectively, the largest) eigenvalue of A, and ||A||i 
For a subset T) A4, iv denotes the indicator function of V, i.e. 




iD[u) = 


0 

+00 


if M G 2 ? 
otherwise. 


For 6 G Ai+, and 7 > 0, we denote the proximal operator associated with Problem Q as 


me-, a) Argmin \gain) + ^ ||u - 0 + 7(6' - 0 ^)||pl . 

u(^M+ I ^7 J 


( 6 ) 


For 0 < £ < ^jJ, we define 


M+{£,'ijj) = {0 G M+ : Amin(0) > £■, and Amax(0) < '</’} • 

3. A PROXIMAL GRADIENT ALGORITHM FOR PROBLEM Q 
We begin this section with a brief review of proximal gradient algorithms, following |Nes- 


terov 


(2013), which concerns the minimization of the following generic convex optimization 


problem: 


mm 


uj) /(w) + g{u: 


(7) 

I; g{-) is a closed convex 


where, Q is a convex subset of a Euclidean space with norm 
function and /(•) is convex, smooth on satisfying: 

||V/»-V/>')ll (8) 

for u,u' G and 0 < Z < 00 . The main ingredient in proximal gradient descent methods 
is the efficient computation of the proximal-operator (“prox-operator” for short), given 
by: 

(9) 


Tm) '= Argmin ||w - (w - 7V/(a;))f + g{oj), 


for some choice of 0 < 7 < 1/L. The following simple recursive rule: 

ujk+i = f^{ujk), k>l, 

for some initial choice of wi G O and 7 = 1/Z, then leads to a solution of Problem ([^ 
(See for example, Nesterov (2013)). 

Problem (j^ has striking similarities to an optimization problem of the form ([^, with 
/(•) = /(•), n = Af+ endowed with the Frobenius norm, g{-) = ga{-), and with given 
by ([^. However, the use of the proximal gradient algorithm for Problem ([^ presents some 
immediate challenges since: 
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• The gradient of the smooth component, namely, V f (9) = —6~^ +5 is not Lipschitz 
on the entire domain (as required in Q), due to the unboundedness of the 
map 9 I—)• 9~^. 

• The corresponding proximal map T^(- ; a) defined in Q need not be simple to 
compute. 


Our first task in this paper, is to show how each of the above problems can be alleviated. 
We note that Rolfs et al. (2012) also analyze a proximal gradient descent algorithm for the 
case a = 1. We present here a self-contained analysis: our proofs have some differences 
with that of Rolfs et al. (2012); and lays the foundation for the stochastic optimization 
scheme that we analyze subsequently. Loosely speaking, we will show that even if the 
function f{9) does not have Lipschitz continuous gradient on the entire feasible set A^+, 
it does satisfy ([^ across the iterations of the proximal gradient algorithm. In addition, 
we demonstrate that by appropriately choosing the step-size 7 , the proximal map T.y(- ; a) 
can be computed by “dropping” the constraint 0 G A1+. We formalize the above in the 
following discussion. 

For a £ [0,1], 7 > 0, and 9 £ M. (i.e., the set oipxp symmetric matrices), the proximal 
operator associated with the function ga is defined as 


Prox.y(0;Q;) Argmin 
ueM 


9a{u) -h ^ \\u-9\\l 


This operator has a very simple form. It is a matrix whose (i,j)th entry is given by: 


(Prox.^( 6 ';a)).^. 


' 0 

dij-aX"/ 

^ 1+(1—q)A7 

djj+aX'r 
. 1+(1—a)A7 


if \9ij\ < aAy 
if 9ij > oAy 
if 9ij < —oAy. 


( 10 ) 


For 7 > 0, and 9 £ M+, we consider a seemingly minor modification of the operator ([^, 
given by: 


Ty{9;a) 


== Argm\n(gaiu) + ^\\u-9 + 'y{S-9 ^)||p| 

uGM ^ ^ 

= Prox.^ (0 — 7 ( 5 * — 9~^)-, a) . 


( 11 ) 


Compared to (|^ , one can notice that in © the positive definiteness constraint is relaxed. 
It follows from (10) that T^{9]a) is straightforward to compute. Notice that if T^{9-,a) 
is positive definite, then T^{9; a) = T^{9; a). We will show that if 7 is not too large then 
indeed T^{9;a) = T^{9;a) for all 9 in certain subsets of AI+. 

At the very onset, we present a result which provides bounds on the spectrum of 9, a 
solution to Problem (^. The following lemma can be considered as a generalization of 
the result of Lu (2009) obtained for the Glasso problem (with a = 1). Let us define the 
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following quantities: Ai aX, X 2 (1 — a)A/ 2 , ^ Il'S'lb + Xip, 


L ' 


if a e [ 0 , 1 ) 


- ifa = l, 

Ui {p - WS) - 2pX2il) 


c{t) = 


1 


Ai(l-t) 


Ai||0(t)||i-tAiTV(0(t)) + A2||0(t)rF - 


Ai(l-t) 


(12) 


U 2 inf c{t), 
t£(0P) 


where, we take t G (0,1) and 6 {t) {S + tXiI) 

Lemma 1. If 6 is a solution to Problem Q with A > 0, then 9 is unique, and 9 G 
Al+(£*, V'ub); with tpuB = Hiiii{C^i) t^ 2 }, where, Ui,U 2 are as defined in (12). In other 
words, we have the following bounds on the spectrum of 9 

'^min {0) > 4, -^max {§) < 'IpUB- 

□ 


Proof. The proof is presented in Section 8.1 


We make a few remarks about the bounds in (12). 

• Computing U 2 requires performing a one dimensional minimization which can be 
carried out quite easily. Conservative but valid bounds can be obtained by replacing U 2 
by evaluations of c(-) at some values of t G (0,1) for example: t = ^ and t = 0+ (provided 
S is invertible). 

• Since the condition number of 9 is cond(0) = Amax(^)/Amin(^)) the result above implies 

that cond(0) < We note, however, that this upper bound may not be an 

accurate estimate of cond( 0 ). 

We now present an important property (Lemma of the proximal gradient update 
step, for our problem. Towards this end, we define v Amin (•S') — Aip and 


; 1 del 


-V+\J v2+8A2 
4A2 


1 
V 

+00 


if a G [0,1), 

if a = 1 and u > 0 

if a = 1 and u < 0 . 


It is obvious that 0 < li, < < 00 . We also dehne 

min + Vp i'fiuB - 4)) • 

Lemma 2. Take 7 G (0,1'*] and let {9j, j > 0} be a sequence such that 9j = T.y{9j-i; a). 


',n 


for j > 1. Ifdo G then 9j G Al+(1*, V’*) for all j > 0. 

Proof. See Section [ 8 ^ 


□ 


In the special case of Glasso {a = 1), the results of Lemma correspond to those 
obtained by Rolfs et al. (2012). Our proof, however, has differences since we rely more 
heavily on basic properties of proximal maps. 
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Lemma shows that for appropriate choices of 7 > 0, the two proximal maps and 
T-y produce the identical sequences that remain in the set This suggests that 

one can solve Problem Q using the proximal operator Tj, as the next result shows. 


Theorem 3. Fix arbitrary 0 < £ < ip < 00 . For k > 1, let {6j, 0 < j < k} be a 
sequence obtained via the map T^: Oj+i = Ty{0j;a), for some 7 G Suppose that 

6,9j G FAj^{£,'ip), 0 < j < k. Then 


Ok — 0 


< P 


Oo-e 


and 


tiPk) 4^c 


< 


O0-9 


27 


' mm 


-,P 


(13) 


where p = 1 — . 


Proof. See Section 8.3 


□ 


Remark 4. If £ = and ip = V’*, and 9q G M.j^[£.^,TtTm.{ipuB,'>P]f}), 
that 9, 9j G Ai-^-{£,'ip), 0 < j < k is redundant, as shown in Lemma 1 


then the assumption 
jji and (13) holds. □ 


An appealing feature of the iteration 9k+i = T.y{9k]a) is that its convergence rate is 
adaptive, i.e., the algorithm automatically adapts itself to the fastest possible convergence 
rate dictated by the condition number of 9. This is formalized in the following corollary: 


Corollary 5. Let 0 < < i/;** < 00 be such that Xmin{&) > and Xmax{d) < "0**. Let 

{9k, k > 0} be a sequence obtained via the map T-y: dj+i - T^ {9j-,a), for some'y £ (0,^**]. 
Ifliuik 9k = 9, then there exists ko > 0 , such that for all k > ko, 


9k-9 


< 1 - 


7 




k—ko 


^ko 


Proof. By assumption, 9 belongs to the interior of V’**)- Since 9k —)• 9, there 

exists ko > 0, such that 9k G -0**) for k > ko. Then we apply the bound (13), 

and the lemma follows. □ 


The analysis above suggests the following practical algorithm for Problem (§. Let { 7 fc} 
denote a sequence of positive step-sizes with lim^. 7 ^. = 0. An example of such a sequence 
is 'ik = 7o/ 2^, for some 70 > 0. For convenience, we summarize in Algorithm 1, the 
deterministic proximal gradient algorithm for Problem (|^. 

Algorithm 1 (Deterministic Proximal Gradient). 

Set r = 0. 

(1) Choose 9o G A1+. 

( 2 ) Given 9k, compute: 9k+i = i9k\ oP). 

(3) If Amin(^*fc+i) < 0, then restart: set A; •(— 0, r I'+lj and go back to (1). Otherwise, 
set k k + 1 and go back to ( 2 ). 


We present a series of remarks about Algorithm 

• Positive Definiteness. In Step 3, positive dehniteness is tested and the algorithm 
is restarted with a smaller step-size, if 9k+i is no longer positive dehnite. The smallest 
eigenvalue of 9k+i, i.e., Amin( 6 *fc+i) can be efficiently computed by several means: (a) it 
can be computed via the Lanczos process (see e.g. Golub and Van Loan (2013) Theorem 
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10 . 1 . 2 ); (b) it may also be computed as a part of the step that approximates the spectral 
interval of 6k+i using the procedure of Chen et al. (2011) (c) a Cholesky decomposition 
of Ok+i also returns information about whether 9k+i is positive definite or not. 

An efficient implementation of the algorithm is possible by making Step 3 implicit. For 
instance the positive definiteness of 0 fc+i can be checked as part of the computation of the 
gradient Vf{9k+i) = S — 9'j^^^ in Step 2 . 


• Step Size. If the initial step-size satisfies 70 < and 9o G V’*)) the algorithm is 

never re-initialized according to Lemmaj^ and Theoremholds. However, it is important 
to notice that Lemma and Theorem present a worst case analysis scenario and in 
practice the choice 70 = can be overly conservative. In fact, Corollary dictates that 
a better choice of step-size is 70 = Amin(^)^- Obviously Amin(^) is rarely known, but what 
this implies is that, in practice, one should initialize the algorithm with a large step-size 
and rely on the re-start trick (Step 3) to reduce the step-size, when 9^+1 is not positive 
definite. 


• Adaptive Convergence Rate. We have seen in Corollary that the convergence rate of 
the sequence {9^} improves with the iterations. This adaptive convergence rate behavior 
makes the cost-complexity analysis of Algorithm more complicated. However, to settle 
ideas, if we set 9o close to 9, and the step-size obeys 7 « Amin(^)^) Theoremj^and Corollary 
imply that the number of iterations of Algorithm needed to reach the precision e (that 


IS 


9h — 


<e) 


IS 


O 


V c 



O cond(0)^ loge^ . 


• Computational Cost. The bottleneck of Algorithm is the computation of the inverse 
0^^, which in general entails a computational cost of 0{p ^)—See Figure showing the 
computation times of matrix inversions for real symmetric p x p matrices, in practice. It 
follows that in the setting considered above, the computational cost of Algorithm [T| to 
achieve a e-accurate solution is O (p3cond(0)2 log(l/e)). 


4. Stochastic Optimization Based Algorithms 


When p is large (for example, p = 5,000 or larger), the computational cost of Algo¬ 
rithm 1 becomes prohibitively expensive due to the associated matrix inversions—this is 
a primary motivation behind the stochastic optimization methods that we develop in this 
section. For 9 G A4+, let Trg denote the density of N(O,0“^), the mean-zero normal dis¬ 
tribution on with covariance matrix 9~^. We begin with the elementary observation 
that 


9-^ 



This suggests that on AI+, we can approximate the gradient Vf{9) = S — 9 ^byS — 

N~^ Ylf=i where zi:N ng-, here, the notation zi:N denotes a collection of random 
vectors Zi,i< N. 

To motivate the stochastic algorithm we will first establish an analog of Lemma show¬ 
ing that iterating the stochastic maps obtained by replacing 9~\ in computing T.y(0j_i; a) 
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in (0 by the Monte Carlo estimate described above, produces sequences that remain pos¬ 
itive definite with high probability. Towards this end, fix 7 > 0; a sequence of (positive) 
Monte Carlo batch-sizes: {N^, k > 1}; and consider the stochastic process {6^, k > 0} 
defined as follows. First, we fix 6q G For A: > 1, and given the sigma-algebra 

Tk-i= 


generate 


i.i.d. 




10 


compute Sfc 


1 

Wk 


Nk 

E 

i=i 


ZjZj, 


and set: 

Ok = Prox.^ ( 6 »fc_i - 7 (5 - Sfc)). 
For any 0 < i < < 00 , we set 


-0) = inf {/c > 0 : 9k ^ M+{i, i/')} , 


(14) 


(15) 


with the convention that inf 0 = 00 . For a random variable T > .^, we define t(£, T) as 
equal to r(£, V') on {T = V’}- 

Given e > 0, we define /Xe = || 5'||2 -|- (Ai -|- e)p, 


4(6) 


del 




Me 


4 A 2 


if a G [0,1) 
if a = 1. 


Similarly, define Vg = Amin (<5) — (Ai -|- e)p, 


, 1 / s det 
W) = 


-^e + -v/i^|+8A2 

^ 4A2 

+ 00 


if a G [0,1), 

if a = 1 and t'g > 0 

if a = 1 and < 0. 


It is easy to check that 0 < i*(e) < 4 < V’* E '^★(f) < 00 . 

The following theorem establishes the convergence of the stochastic process 9k, produced 


via the stochastic optimization scheme (15). 


Theorem 6. Let {9k, k > 0} be the stochastic process defined by the rules (14 15). 
Fix e > 0. Suppose that 9o G A4+(G(6), min('0[/B, i/:*(e))). Then there exists a random 
variable 4'*(e) > G(e) such that 


P[T(4(e),T*(e)) 



If Ylj ^ then E('I'*(e)^) < 00 (hence 4'*(e) is finite almost surely), and on 

{t (4(e), ^*(e)) = 00 }, limfc^oo Ok = 9. 


Proof. See Section [ 8 ^ 


□ 
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Growth Condition on the Monte Carlo batch size. If we let the Monte Carlo sample size 
Nk increase as 


Nk > 


3 \ogp 


+ 


min (l,l'2(e)e2/16) 

for some g > 1, then N~^ < oo, and the bound in Theorem above, becomes 

P [r (4(e), T*(e)) = oo] > 1 - y, 


where p = exp (—a min(l, ^^(e)e^/16)j'?) < oo. Hence for high-dimensional prob¬ 

lems, and for moderately large Monte Carlo sample sizes, P [r (4(e),'h*(e)) = oo] can 
be made very close to one—this guarantees that positive definiteness of the process 
/c > 0} is maintained and the sequence converges to 0, with high probability. The 
convergence rate of the process is quantified by the following theorem: 


Theorem 7. Let {9k, k > 0} he the stochastic process defined by (I 4 15). 
0 < i < < -|-oo, suppose that 9o,6 G A^+(£, i/’), and 7 < 4 . Then 


For some 


E 


'■ {T{i,ip)>k} 


6k — 9 


< 1 - 


-02 


9o-9 


+ 27V-2{p + P^)^JV7 (i-^)‘ (16) 


Proof. See Section 8.5 


□ 


As with the deterministic sequence, the convergence rate of the stochastic sequence {9k} 
is determined by the condition number of 9. To see this, take 0 < 4* < V’** < 00 , such 
that 4* < Amin(^), and Amax(0) < V’**- It is easy to show that a conditional version of (16) 


holds almost surely: for 0 < ko < k, and for =" inf{A: > ko : 9k ^ 


def 


d e A 4 + (e** ,1/1** 





9k 


9 




< 



9ko - ^ 


2 

F 


+ 27'C*(p + J>') E ^7' 

j=ko+l 



k-ko-j 


(17) 


Therefore, as 9k ^ 9 almost surely, and since 9 G Al+(e**, V'**), one can find ko such that 
with high probability, and for all k > ko, the following holds: 




(18) 
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If we make the (strong) assumption that (18) holds with probability one, then one can 
deduce from (17) that for k> ko, 

k—ko 


E 


Ok-9 


< 1 - 


7 


ip': 


E 


■ 

2" 

1 

0 

0:5 

F 


E 


N. 


-1 


1 - 


_7_ 

',2 




k-ko-j 


(19) 


+ 27^4* (P + P^) 

j=ko+l 

which is an analogue of Corollary As in the deterministic case, this adaptive behavior 
complicates the complexity analysis of the algorithm. In the discussion below, we consider 
the idealized case where £** = ^mm{0), V'** = Amax( 6 '), and 9q G A 1 +( 4 *,iA**)- 

Implications of Theorem and choice of Nj. We now look at some of the implications 
of Theorem]^ and ( |19[ ) in the ideal setting where ko = 0. If Nj is allowed to increase as 

Nj = for some q > 0, then ^ ~ ^ A: 

implication of Theorem]^ and (19) is that, as k 


00 ; then the 


00 , 


E 


Ok-e 


= o 


i - 


7 






7^fc 


= 0{p^ + 


'yki 


( 20 ) 


with p = 1 — Notice that the best choice of the step-size is 7 = Setting 7 = it 


follows that the number of iterations to guarantee that the left-hand side of ( 20 ) is smaller 
than e G (0,1) is 

log(e“^^ 

I \ / 




V 


log(p-i)’ 


( 21 ) 


€ 


where p = 1 — and a V 6 = max(a, b). This implies that in choosing Nj = \N + , 

one should choose q > 0 such that 


£ 


log(e 


- 1 ^ 


log(p-i) 


= O 




-^log(e ) ) = O (cond(6') log(e 


-1 


( 22 ) 


where cond(0) = Amax(^)/Amin(^) is the condition number of 6 . Incidentally, (22) shows 
that one should choose (7 > I, as also needed in Theorem 

The results developed above suggest the following stochastic version of Algorithm 
As above, let { 7 ^, A: > 0} be a sequence of positive step-sizes decreasing to zero, and let 
{Nk, A: > 0} be a sequence of Monte Carlo sample sizes. That is, Nk is the number of 
Monte Carlo sample draws from vre^ at iteration k. Algorithm is summarized below: 

Algorithm 2. 

Set r = 0. 

(1) Choose Oq G M.+. 


(2) Given 6 k, generate zi:Nk 


i.i.d. 


TTg^,, i.e., the density of N(0, ), and set 

, Nk 
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(3) Compute 

9k+i = Prox^^ {Ok - 7r(5 - Sfc+i); a). 

(4) If Amin(^fc+i) < 0) then restart: set A: ■(— 0, r ■<— r+1, and go back to (1). Otherwise, 
set A: ■<— A: + 1 and go to (2). 

Remark 8 . As in Algorithm the actual implementation of Step 4 can be avoided. For 
instance if the simulation of the Gaussian random variables in Step 2 uses the Cholesky 
decomposition, it returns the information whether Amin(^fc+i) < 0. In this case, we restart 
the algorithm from 9q (or from 9k), and with a smaller step-size, and a larger Monte Carlo 
batch size. □ 

4.1. Sampling via dense Cholesky decomposition. The main computational cost 
of Algorithm 2 lies in generating multivariate Gaussian random variables. The standard 
scheme for simulating such random variables is to decompose the precision matrix 9 as 

9 = B!R, (23) 

for some nonsingular matrix R G Then a random sample from N(0, 9~^) is obtained 

by simulating u ~ N(0, Ip) and returning R~^u. The most common but remarkably effec¬ 


tive approach to achieve the above decomposition (23) is via the Cholesky decomposition. 


which leads to R being triangular. This approach entails a total cost of 0{p^m + p^/3) 
to generate a set of m independent Gaussian random variables and computing the outer- 
product matrix, which forms an approximation to 9~^. The term p^/3 accounts for the 
cost of the Cholesky decomposition; and p'^m accounts for doing m back-solves R~^Ui for 
m many standard Gaussian random vectors Ui,i = 1,... ,m] and subsequently computing 
m each back-solve R~^Ui can be performed with 0 {p^) 

cost since R is triangular. This shows that an iteration of Algorithm implemented via 
Cholesky decomposition, is more cost-effective than an iteration of Algorithm if the 
number of Gaussian random samples generated in that iteration is less than p. Since A:* 
iterations (as defined in (|21[)) are needed to reach the precision e, and Nk = A^ -|- A:^ (we 
assume that q is chosen as in (|22[)), we see that the number of samples per iteration of 
Algorithm remains below p, if p > cond(0)^e“^. In this case the overall computational 
cost of Algorithm to obtain a e-accurate solution is 

.log(e-i)' 


O 


P 


= O (p'^cond(0)"^ log(e 


-1 


log(/9-i) 

We caution the reader that, on the surface, the above cost seems to be of the same 
order as that of the deterministic algorithm (Algorithm [^, as seen from Theorem 
However, the constants in the big-0 notation differ, and are much better for the Cholesky 
decomposition than for inverting a matrix—see Figure for a compelling illustration 
of this observation. In addition, as the problem sizes become much larger (i.e., larger 
than p ~ 35, 000) matrix inversions become much more memory intensive than Cholesky 
decompositions; leading to prohibitely increased computation times—see Figure 

4.2. Sampling via specialized sparse numerical linear algebra methods. As an 

alternative to the above approach, note that equation (23) is also solved hy R = 0^/^. 
If 9 is sparse and very large, specialized numerical linear algebra methods can be used 
to compute 9~^/‘^b for a vector or matrix h, with matching dimensions. These methods 





PRECISION MATRIX COMPUTATION VIA STOCHASTIC OPTIMIZATION 


17 


include Krylov space methods (Hale et al. 


trix function approximation methods (Chen et al 


(2008); Eiermann and Ernst (2006)), or ma- 
(2011)). These methods heavily exploit 
sparsity and typically scale better than the Cholesky decomposition when dealing with 
very large sparse problems. Eor instance, the matrix function approximation method of 
Chen et al. (2011 [ ) has a computational cost of 0{m{p + Cp)) to generate a set of m 
samples from N(O,0“^), where Cp is the cost of performing a matrix-vector product 9b 
for some b £ As comparison. Figure shows the time for generating 1,000 random 
samples from N(O,0“^), using dense Cholesky factorization, and using the matrix func¬ 
tion approximation approach of ( |Chen et al.| (2011)), for varying values of p. The value 
of p around which the matrix approximation method becomes better than the Cholesky 
decomposition depends on the sparsity of 6 , and the implementations of the methods. 
These specialized sparse methods, however, need to be used with caution. For one thing, 



p 


(Zoomed) 



P 


Figure 2. Figure showing the times in seconds to generate 1,000 Gaussian ran¬ 
dom samples from N(O,0“^), where 9 G is constructed as explained in 

Section 7.1.1 with the proportion of non-zeros entries approximately set at 5/p. 


these methods are quite sensitive to the sparsity level of the iterates 9^, and ultimately 
to the sparsity level 9, the solution to Problem i) — the methods are useful only when 
the solutions are sufficiently sparse. This behavior should be contrasted to that of dense 
Cholesky decomposition based methods, which are less sensitive to the sparsity level of 9. 
Based on our experiments (not reported here), we recommend the use of dense Cholesky 
decomposition methods in the initial stages of the algorithm, when the iterates 9k are 
relatively dense. As the number of iterations progresses and the estimates become more 
sparse, we recommend the use of specialized sparse numerical linear algebra methods for 
sampling from the Gaussian distributions. Since the use of dense Cholesky decomposi¬ 
tion methods amply substantiates the main message of our paper—the effectiveness of 
stochastic gradient methods as a computationally scalable alternative to their determinis¬ 
tic counterparts, our experimental results reported in Section focus on dense numerical 
linear algebra methods. 

4.3. Borrowing information across iterations. A main limitation of Algorithm is 
that at each iteration k, all the Monte Carlo samples used to estimate 9//^ are discarded. 
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and new samples are generated to approximate We thus ask, is there a modified 

algorithm that makes clever use of the information associated with an approximate 9^^ to 
approximate In this vein, we propose herein a new stochastic algorithm: Algorithm ^ 

which recycles previously generated Monte Carlo samples in a novel fashion, to update its 
approximation for := from := 0 ^^. 

This new algorithm relies on the following algorithm parameters (a) N, where > 1 
is a given integer, and (b) {Cfc, k>l} which is a sequence of positive numbers such that 

2 


J2Ck = oo, and ^C| 


< oo. 


(24) 


k>l 


k>l 


The algorithm is summarized below: 

Algorithm 3. Set r = 0. 

(1) Choose 00 G A4+, and Sq G AI+. 


(2) Given 6k, and generate zi:N = N(0, 9j^ ^), and compute 


/ I N ^ \ 

^fc+l ~ T Cfc+l ( ^ ^ ^ ^k^k I • 


(25) 


k=l 


(3) Compute 

0 fc+i = Prox^^ (0fc - 7r(5 - Sfc+i); a) ■ (26) 

(4) If Amin(^fc+i) < 0, then restart: set A: ■<— 0, r ■<— r+1, and go back to (1). Otherwise, 
set A: ■<— A: + 1 and go to (2). 

Notice that in Algorithm the number of Monte Carlo samples is held hxed at N. 
Hence its cost per iteration is constant. 

Algorithm is more difficult to analyze because the two recursive equations (25) and 
are intimately coupled. However, the next result gives some theoretical guarantees 
by showing that when the sequence {6k, k >0} converges, it necessarily converges to the 
minimizer of Problem ([^, i.e., 0. 

Theorem 9. Let {6k, k >0} be the stochastic process generated by Algorithm^ where, 
the sequence {Cfe} satisfies (24)- Fix < I < < oo. Suppose that 0,0o G A4+(£, V’); and 

7 < Then, on the event 

{T{i,fi) = +oo, and {6k} converges}, 

we have that limfc_,.oo 6k = 6. 


Proof. See Section 8.6 


□ 


5. Exact Thresholding into connected components 


As mentioned in Section the exact covariance thresholding rule (Mazumder and 


Hastie, 2012a), originally developed for the Glasso problem plays a crucial role in the 


scalability of Glasso to large values of p, for large values of A. One simply requires that 
the largest connected component of the graph ((l(|sij| > A))), is of a size that can be 
handled by an algorithm for solving Glasso of that size. In this section, we extend this 
result to the more general case of Problem ([^. 
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Consider the symmetric binary matrix £ := with £ij = l(|sjj| > aA), which 

defines a graph on the nodes V = p}. Let {Vj,£j),j = denote the J 

connected components of the graph Let 6 he a minimizer of Problem ([^ and 

consider the graph £ induced by the sparsity pattern of 9, namely, £ij = 1(|%| 7 ^ 0). Let 
the connected components of {V,£) be denoted by (Vj,£j),j = The following 

theorem states that these connected components are essentially the same. 


Theorem 10. Let {Vj,£j),j = and {Vj,£j),j = denote the connected 

components, as defined above. 

Then, J = J and there exists a permutation 11 on {1,..., J} such that Vn(j) = Vj and 
£uij) = £j for all j = 1,..., J. 


Proof. See Appendix, Section 8.7 for the proof. 


□ 


Note that the permutation 11 arises since the labelings of two connected component 
decompositions may be different. 

is appealing because the connected components of the graph £ij = l(|sij| > 


Theorem 


10 


a A) are fairly easy to compute even for massive sized graphs—see also Mazumder and 


Hastie (2012a) for additional discussions pertaining to similar observations for the Glasso 


problem. A simple but powerful consequence of Theore m |10| is that, once the connected 
components {Vj,£j),j = 1,... ,J are obtained. Problem (1^ can be solved independently 
for each of the J different connected component blocks. In concluding, we note that 
Theorem 10 is useful if the maximum size of the connected components is small compared 
to p, which of course depends upon S and A, a. 


6. Special Case: Ridge regularization 

In this section, we focus our attention to a special instance of Problem ([^, namely, 
the ridge regularized version, i.e.. Problem ([^ for some value of A > 0. Interestingly, 
the solution to this problem can be computed analytically as presented in the following 
lemma: 


Lemma 11. Let S = UDU' denote the full eigendecomposition of S where, D = diag(di ,... ,dp 
For any A > 0 and a = 0 the solution to Problem (© is given by: 9 = Udiagfa)U', where, 
diag(a) is a diagonal matrix with the ith diagonal entry given by 


(Ti = 


—di + + 4A 


2A 


for i = l,...,p. 


Proof. For the proof see Section 8.8 


□ 


We make the following remarks: 

• Performing the eigen-decomposition of S is clearly the most expensive part in 
computing a solution to Problem ([^; for a general real p x p symmetric matrix 
this has cost 0{p^) and can be significantly more expensive than computing a 
direct matrix inverse or a Cholesky decomposition, as reflected in Figure 

• When p ^ n and n is small, a minimizer for Problem ([^ can be computed for 

large p, by observing that S = ^ Yffi=i eigendecomposition 
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of S can be done efficiently via a SVD of the n x p rectangular matrix X with 
0 {Tn?p) cost, which reduces to 0 {p) for values oi p^ n with n small. 

However, computing the solution to Problem Q becomes quite difficult when both 
p and n are large. In this case, both our stochastic algorithms: Algorithms and 
are seen to be very useful to get an approximate solution within a fraction of the 
total computation time. Section 7.2 presents some numerical experiments. 


7. Numerical experiments 

We performed some experiments to demonstrate the practical merit of our algorithm 
on some synthetic and real datasets. 

Software Specifications. All our computations were performed in Matlab (R2014a (8.3.0.532) 
64-bit (maci64)) on a OS X 10.8.5 (12F45) operating system with a 3.4 GHz Intel Core i5 
processor with 32 GB Ram, processor speed 1600 MHz and DDR3 SDRAM. 


7.1. Studying sparse problems. 


7.1.1. Simulated data. We test Algorithms an d[3]with p = 10^, 5 X 10^, and p = 10^ for 
some synthetic examples. The data matrix S G is generated as S' = n~^ Z]j=i 

where n = p/ 2 , and Xi-n Np(O, 0 T^)) for a “true” precision matrix 0 * generated as 
follows. First we generate a symmetric sparse matrix B such that the proportion of non¬ 
zeros entries is 10/p. We magnified the signal by adding 4 to all the non-zeros entries of 
B (subtracting 4 for negative non-zero entries). Then we set 9^, = B + (i — Amin mip, 
where Amin(.H) is the smallest eigenvalue of B, with i = 1. 

Given S, we solve Problem with a ^ 0.9 and A oc ^yiog{p)/n such that the sparsity 
(i.e., the number of non-zero^ of the solution is roughly 10/p. In all the examples, 
we ran the deterministic algorithm (Algorithm 1) for a large number of iterations (one 
thousand) with a step-size 7 = 3.5 to obtain a high-accuracy approximation of 6 , the 
solution to Problem Q (we take this estimate as 6 in what follows). Algorithms 1, 2 and 
3 were then evaluated as how they progress towards the optimal solution 9 (recall that the 
optimization problem has a unique minimizer), as a function of time. All the algorithms 
were ran for a maximum of 300 iterations. Further details in setting up the solvers and 
parameter specifications are gathered in Section]^ (appendix). To measure the quality of 
the solution, we used the following metric: 

Relative Error = \\9k — 0||_f/||0||_f, 


as a function of the number of iterations of the algorithms. Since the work done per itera¬ 
tion by the different algorithms are different, we monitored the progress of the algorithms 
as a function of time. The results are shown in Figure]^ 

We also compared the performance of our algorithms with the exact thresholding scheme 
(Section switched “on” — this offered marginal improvements since the size of the 
largest component was comparable to the size of the original matrix — see Section for 
additional details on the sizes of the connected components produced. We also compared 


our method with a state-of-the algorithm: QuiC (Hsieh et al., 2014), the only method that 


seemed to scale to all the problem sizes that have been considered in our computational 
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experiments. We used the R package QUIC, downloaded from CRAN for our experiments. 
The results are shown in Table [TJ 

We note that it is not fair to compare our methods versus QuiC due to several rea¬ 
sons. Firstly the available implementation of QuiC works for the Glasso problem and 
the experiments we consider are for the generalized elastic net problem Q. Furthermore, 
QuiC is a fairly advanced implementation written in C++, whereas our method is im¬ 
plemented entirely in MATLAS. In addition, the default convergence criterion used by 
QuiC is different than what we use. However, we do report the computational times of 
QuiC simply to give an idea of where we are in terms of the state-of-the art algorithms 
for Glasso. Towards this end, we ran QuiC for the Glasso problem with A = a A for 
a large tolerance parameter (we took the native tolerance parameter, based on relative 
errors used in the algorithm QuiC by setting its convergence threshold (tol) as 10“^^), the 
solution thus obtained was denoted by 6. We ran QuiC for a sequence of twenty tolerance 
values of the form 0.5 x 0.9^ for r = 1,..., 20; and then obtained the solution for which the 
relative error \\0r — 0||f/||0||f < Tol with Tol G {0.1,0.02}. For reference, the times taken 
by QuiC to converge to its “default” convergence threshold (given by its relative error 
convergence threshold: tol= 10“'^) were 501 seconds for for p = 5 ,000 and 3020 seconds 
for p = 10 , 000 . 


7.1.2. Real dataset. The Patrick Brown dataset is an early example of an expression ar¬ 
ray, obtained from the Patrick Brown Laboratory at Stanford University and was studied 

(2012a|. There are n = 385 patient samples of tissues from 


m 


Mazumder and Hastie 


various regions of the body (some from tumors, some not), with gene-expression measure¬ 
ments for p = 4718 genes. For this example, the values of the regularization parameters 
were taken as (a, A) = (0.99,0.16). Here, splitting led to minor improvements since the 
size of the largest component was 4709, with all others having size one. We report the 
performance of our methods without using the splitting method. We computed 9 by run¬ 
ning the deterministic algorithm for 1000 iterations, using a step-size 7 = 5 x 10“^. Unlike 
the synthetic experiments, in this case, we considered relative changes in objective values 
to determine the progress of the algorithm, namely, {(j)a{dk) — ^a)/\4>a\, where, we define 
0 a = 0 a(^); and recall that 0 a(') is defined in ([^. 

In this case, we also compared our method with QuiC but the latter took a very long 
time in converging to even a moderate accuracy solution, so we took the solution delivered 
by its default mode as the reference solution 6. QuiC took 5.9 hours to produce its default 
solution. Taking the objective value of this problem as the reference, we found that 
QuiC took 6080.207 and 10799.769 secs to reach solutions with relative error 0.74 and 
0.50 respectively. We summarize the results in Table 

Our empirical findings confirm the theoretical results that for large p, the stochastic 
algorithms reach low-accuracy solutions much faster than the deterministic algorithms. 
We also see that the splitting rule helps, as it should — major improvements are expected 
if the sizes of the connected components are significantly smaller than the original problem. 
The sparsity plot (Figure]^ shows that the solution provided by Algorithmtends to be 
noisy. The averaging step in estimating 6'j^^ in Algorithm makes these estimates much 
smoother, which results in solutions with good sparsity properties. 









22 


YVES F. ATCHADE, RAHUL MAZUMDER, AND JIE CHEN 


Evolution of Relative Error of Algorithms 1—3 versus time 
p = 1000 p = 5000 p = 10, 000 





Evolution of Sparsity of Algorithms 1—3 versus time 
p = 1000 p = 5000 p = 10, 000 





Algorithm 1 
Algorithm 2 
Algorithm 3 




Algorithm 1 
Algorithm 2 
Algorithm 3 


Time (secs) 


Time (secs) 


Time (secs) 


Figure 3. Evolution of relative error [top panel] and sparsity [bottom panel] 
of Algorithms [I][^ versus time (in secs); for three differ ent pr oblem sizes: p G 
{10^,5 X 10^, for the examples described in Section 7.1.1 We observe that 
for larger values of p > 5 x 10^, the new stochastic algorithms proposed in this 
paper: Algorithms 2 and 3 reach moderate accuracy solutions in times significantly 
smaller than the deterministic counterpart: Algorithm 1. Algorithm 3 reaches a 
low accuracy solution quicker, but is dominated by Algorithm 2 in obtaining a 
solution with higher accuracy. For small values of p (p = 1000) the different 
algorithms are comparable because direct matrix inversions are computationally 
less expensive, the situation changes quickly however, with larger values of p (See 
also Figure [^. 


7.2. Studying dense problems. We performed some experiments to demonstrate the 
performance of our method on dense inverse covariance estimation problems. Here, we 
took a sample of size n = p with p £ {10'^, 1.5 x 10^}, from a Gaussian density with 
independent covariates and mean zero. As described in Section it is indeed possible 
to obtain a closed form solution to this problem, but it requires performing a large scale 
eigen-decomposition on S, which can be quite expensive. In this application, proximal 
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Time (in secs) taken by algorithms 

Accuracy Algorithm 1 Algorithm 2 Algorithm 3 QuiC 


Tol 

No Splitting 

With Splitting 

No Splitting 

With Splitting 

No Splitting 

With Splitting 





p = 

: 5000 




10-1 

125.78 

122.16 

62.73 

60.84 

56.78 

53.31 

300.94 

2 X 10“^ 

251.92 

241.49 

161.18 

142.83 

292.34 

271.67 

350.29 




P = 

10,000 




10-1 

921.52 

612.11 

317.35 

155.33 

289.58 

192.28 

2046.73 

2 X 10-2 

1914.65 

1305.65 

766.69 

463.66 

647.27 

563.42 

2373.03 


Table 1. Table showing the times (in secs) to reach an Accuracy of “Tol” for dif¬ 
ferent algorithms, where, Accuracy refers to \\dk ~ ^I|f/|1^||f- Algorithms 2 and 3 
clearly shine over the deterministic method (Algorithm 1) for delivering moderate 
accuracy solutions. Algorithm 3 reaches a solution of moderate accuracy faster 
than Algorithm 2 and Algorithm 1; for smaller values of “Tol” Algorithm 2 wins. 
Here, splitting, which refers to the notion of covariance thresholding described in 
Section]^ is found to help, though not substantially — the regularization param¬ 
eters in this problem lead to connected components of sizes comparable to the 
original problem. The timings of QuiC are shown for reference purposes only, 
to get an idea of the times taken by state-of-the art algorithms. For reference, 
the times taken by QuiC to converge to its default convergence criteria were 501 
seconds for p = 5, 000 and 3020 seconds for p = 10, 000. 


gradient algorithms and in particular the stochastic algorithms presented in this paper, 
become particularly useful. They deliver approximate solutions to Problem in times 
that are orders of magnitude smaller than that taken to obtain an exact solution. 

In the experiments considered herein, we found the following scheme to be quite useful. 
We took a subsample of size m <C n from the original n samples and solved Problem ([^ 
with a covariance matrix obtained from that subsample. This is indeed quite efficient 
since it requires computing the SVD of an m x p matrix, with m <C p. We took the 
precision matrix and the covariance matrix associated with this subsample as a warm- 
start to the deterministic proximal gradient method, i.e.. Algorithm 1 and Algorithm 2 . 
This was seen to improve the overall run-time of the solution versus an initialization with 
a diagonal matrix. 

We summarize our results in Table For the case p = 10,000 our Monte Carlo batch 
size was of Nf^ = 1,000 -|- and we took 7 = O.l/A^^^^C'S')- The algorithms were 

warm-started with the solution of Problem ([^ for a subsample of size m = 100, which 
took 0.1 seconds to compute. For the case, p = 15, 000 we took = 2,000 -|- and 7 
as before. As a warm-start we took the solution of Problem (§ with a subsample of size 
m = 500 which was obtained in 1 second. 

8. Proofs 

This section gathers the proofs and technical details appearing in the paper. 


8.1. Proof of Lemma [H 
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Accuracy 

Tol 

Time (in 

Algorithm 1 

secs) taken by algorithms 

Algorithm 2 Algorithm 3 

0.1 

881.995 

366.864 

451.337 

0.02 

2030.405 

942.924 

> 654 


Table 2. Results on the Patrick Brown microarray dataset (here, n = 385 and 
p = 4718). Algorithm 3 reached a solution of relative accuracy 0.06 within the 
first 500 iterations which took a total time of 654 seconds. Here, we use “Accu¬ 
racy” to denote the relative error: [4>a{Qk) — 4‘a)l\4’a\, where, (j>a is the optimal 
objective value for the problem. For comparison, QuiC for the same dataset when 
set to optimize the corresponding graphical lasso problem with the same tuning 
parameter, took 5.9 hours to converge to a solution with the native (default) tol¬ 
erance criterion. Taking the objective value of this problem as the reference, we 
found that QuiC took approximately, 6080 secs (~ 1.7 hours ) and 10800 secs 
(^ 3 hours) to reach solutions with relative errors 0.74 and 0.50 respectively. The 
algorithms presented in this paper show impressive performance for the particular 
tasks at hand. 


Accuracy Time (in secs) taken by algorithms 

Tol p Algorithm 1 Algorithm 2 


0.1 

10 "^ 

15.42 

4.67 

0.05 

10 ^ 

93.46 

48.490 

0.1 

1.5 X lO-^ 

50.78 

15.70 

0.05 

1.5 X 10^ 

408.87 

176.11 


Table 3. Results for ridge regression. Here, we use “Accuracy” to denote the mea¬ 
sure: — (j^a)l\4'a\^ where, 4>a is the optimal objective value for the problem. 

For p = 10,000 computing the exact solution (using a full eigen-decomposition) 
took 140 secs, for p = 15,000 the exact solution was computed in 500 secs. Both 
Algorithms 1 and 2 obtained approximate solutions in times significantly smaller 


than computing the exact solution to the problem. For details see Section 7.2 


Proof. 


Uniqueness of 6: 

If A 2 > 0 then Problem ([^ is strongly convex due to the presence of the quadratic 
regularizer, hence 


IS 


Spectral bounds on 9: 

Consider the stationary conditions of Problem ([^ : 


Banerjee et al. 

(2008) 

Lu 

(2009 
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where, we use the notation: Z = sign(0), Ai = a\ and A 2 = (1 — a)A/2. It follows 
from (27) that 

0-^ -2A20 =5 + AiZ 


<||5 + AiZ||2l 
<(||5||2 + Ai||Z||2)I 

< (|| 5||2 + Xip) I (since, Zij G [- 1 , 1 ] implies \\Z \\2 < p) 


(28) 


If cJi’s denote the eigen-values of 0 then it follows from (28): 

l/cji - 2A2cri < ||5'||2 + Xip = p. 

Using elementary algebra, the above provides us a lower bound on all the eigen-values of 
the optimal solution 9\ ai > {—p + \/+ 8 A 2 )/( 4 A 2 ) for A 2 7 ^ 0, for alH = 1,... ,p. Note 
that for the case, A 2 = 0 we have ai>l/p for all i. Combining these results we have the 
following: 

ifA 2 /o 

- otherwise, 

. fi ’ 


Amin(0) > L ■■ = 


which completes the proof of the lower bound on the spectrum of 9. 

We now proceed towards deriving upper bound on the eigen-values of 9. 


From (27) we have: 

0 = {§, -9-^ + S + X1Z + 2X29) 


Xi\\9\\i=p-{9,S)-2X2 


Now observe that: 


(0,5) > A„,in(0>r(5) and 


>pxiM. 


(29) 


(30) 


We use £* as a lower bound for Amin(0) and use (30) in ( [^ to arrive at: 

||0||i < (p - LTt{S) - 2pX2il) := Ui 

Al 


(31) 


The above bound can be tightened by adapting the techniques appearing in Lu (2009) 
for the special case A 2 = 0; as we discuss below. Let 9{t) := (5 -|- tAiI)“^ be a family of 
matrices defined on t G (0,1). It is easy to see that 


9{t) G Argmin {— logdet(0) -|- (5 -|- tAiI, 0)} , 
e 


which leads to 


— log det(0(t)) -|- (5 -|- tAiI, 9{t)) < — log det(0) -|- (5 -|- tAiI, 0) 
-logdet(0) + (5,0) + Ai||0||i + A 2 0 ' < - logdet( 0 (t)) + (5, 0 >)) ( 32 ) 


+Ai||0(t)||i -l- A 2 9{t) 
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where, the second inequality in (32) follows from the definition of 9. Adding the two 
inequalities in (|32l) and doing some simplification, we have: 


Ai||0(t)||i-tAiTV(0(t)) + A2 m 


— A 2 


> Ai||0||i -tAiTr(0) > (Ai -tAi 


where, the rhs of the above inequality was obtained by using the simple observation 
Tr(0) < ||0||i. Dividing both sides of the above inequality by Ai — tXi we have: 

2 


Ii < 


1 


Ai(l-t) 


Ai||0(t)||i-tAiTr(0(t)) + A 2 e{t) 


A 2 


:=a{t) 


Al(l - 1 ) 

' -V-" 

:=b{t) 


(33) 


Observing that 9 > i^p and applying it to (33) we obtain: 

||0||i < (a(t) - 

where, a{t) = (^Ai||0(t)||i - t\iTT{9{t)) + A 2 9{t) and b{t) := 

Inequality (34) in particular implies: 


(34) 


1 1 < inf 
* 6 ( 0 , 1 ) 


a{t) - 


■.= U2 


(35) 


where, the minimization problem appearing above is a one dimensional optimization and 
can be approximated quite easily. While a closed form solution to the minimization 
problem in ( [^ may not be available, ||0||i can be (upper) bounded by specific evaluations 
of a{t) — b{t) at different values of t G (0,1). In particular, note that if S is invertible then, 
taking t ^ 0+ we get: 


< (lis-‘lli + h||s-‘||^ 


X2f 


LB 


Al 


otherwise, taking t = 5 leads to: ||0||i < a(4) — 6(4). 


Combining (31) and (35), we arrive at the following bound: 

||0||i < min{t/i, U 2 } 


(36) 


Now observe that: 

A. 


■XO) ■■= 


I 2 < 


< ||6»||i < min{t/i, U 2 } := i^uB- 


□ 


8.2. Proof of Lemma 

Proof. First Part: Lower bound on Amin(0j) 


Set 9 = 9 — 'y{S — 9 ^). By definition, T.y{9;a) = Argmin^g^ Qaiu) + ^ ~ ^||p 

By the optimality condition of this optimization problem, there exists Z G A4 in the sub¬ 
differential of the function 9 1 —)• ||0||i at T.y{9]a) such that Zij G [—1,1], {Z,T.y{9;a)) = 
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\T^{0-, a)||i, and 


^(r^(0; a)-d) + aXZ + (1 - a)\T^{e] a) = 0. 


The fact that Zij E [—1,1] implies that \\Z \\2 < \\Z\\^ < p. Hence, 

^m\n{Z) ^ p, and Amax(-^) ^ P* 


(37) 

(38) 


Using 9 = 6 — 7(5 — 9 ^), we expand (37) to 


Ty{9] a) = (1 + (1 - a)A 7 )"^ {9 - 7(5 - 0"^ + aXZ)) . 

We write 9 — 7(5 — 9~^ + aXZ) = 9 + 7 ^“^ — 7 ( 6 ' + aXZ). We will use the fact that 
for any symmetric matrices A,B, Amin(^ + B) > Amin(^) + '^min(7?)) Amax(^ + B) < 


Amax(^) + Amax(-B) (see e.g. Golub and Van Loan (2013) Theorem 8.1.5). In view of (38) 
we have: 

Amin {0 - 7(5” - 9~^ + aXZ)) > Amin(6' + 7^“^) - 7 (Amax(5') + aXp) . (39) 

Notice that the function x i-A x + ^ is increasing on [^/ 7 , 00 ), and by assumption 
(■-k > \/ 7 - Therefore, if Amin(^) > Gj we use the eigen-decomposition of 9 to conclude that 

Amin {P + 7^ = -^min(^) + T-> G + ■^- (40) 


Hence 




^* + 1-- 7(^max(5') aXp) 


Amin a)) > (1 + (1 - a)X'y) 

where the last equality uses the fact that 7* satisfies the equation 

(1 — a)A7^ -|- (Amax(<S') -|- aXp)ii, — 1 = 0. 


= 4, 


(41) 


Second Part: Upper bound on Amax(%) 

We will first show that if ipl < p’uB, then Amax(0j) < for all j > 1. Following 
arguments similar to that used to arrive at (39), we have: 


Amax (6* - 7(‘S'- 0 ^ + oAZ)) < Amax(6'+ 7^ ^) - 7 (Amin (S') - oAp) . (42) 

Using Amax(7*) < V’*; and following arguments used to arrive at (40), (41) we have: 


Amax {G + 7^ ^) — Amax(^) + 


A 


Amax(6') P>1' 




= Pi, 


Hence 

Amax (Fy(6'; a)) < (1 (1 - q;)A 7 )' 

where the last equality uses the fact that when 'ipl < oo, it satisfies the equation 

(1 - a)Xipl + (Amin(S) - aXp) - 1 = 0. 


Pl + ^ - 7(Amin(S) - aXp) 

Pi 
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We now consider the case where, ■0] > 'ipjjB, and 9o G The first part 


of the proof guarantees that 9j G Wf+(£*,+oo) for all j > 0. For j > 1, by Lemma 14 


applied with i = ip = + 00 , 9 = and H = 9-^^, we get 


< 


-1-9 


L-i 


This implies that for any j > 1, 


Pjh < \\eh + \\9j-9h 


9n-9 


< \m 2 + 

< 'Pub +'/pi'PuB - ^*)- 

where the last inequality uses Weyl’s inequality since 9o,9 G A4+(£*, V'ub)- 


□ 


8.3. Proof of Theorem [S]. 


Proof. We follow closely the proof of Theorem 3.1. of Beck and Teboulle (2009). Suppose 
that the sequence {9i, 0 < i < k} belongs to For any i > 0, since 0j+i = 

Prox.y(0j — 7(5 — 9~^); a), we apply Lemma 


14 


with H = 9- ^ to obtain 


9^+l - 9 

which implies that 

‘^l[Pa{0k) - 4>i 


< 27 ( 0 a( 6 »i+i) - pa{0) ) + Oi+l - S 




9,-9 


+ 


9k - 9 




9o-9 


Again, from (59), we have 


Pa{9i+l) - Pa{9) < ^ 

27 

We then sum for f = 0 to fe — 1 to obtain 

k 


9,-9 


9i+i — & 




+ 


2=1 


9k - 9o 


< 


9o-9 


We now use Lemma [13] to write 


1 


ga{9i+i) - ga{9i) < - {9i - 9i+i,9i+i - [9i - j{S - 9^^))), 

= -^ \\9,+i - 9,\\l + (9i - 0i+i, 5 - 9 -^}. 


This last inequality together with (55) applied with 9 = and 9 = 9i, yields 
{Pa{9,+l) - Pa{9)} < {Pa{9i) - Pa{9)] - ^ \\9i - 9i+i\\l . 


(43) 


(44) 


( 45 ) 
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By multiplying both sides of the last inequality by i and summing from 0 to A: — 1, we 
obtain 


k ^(j)a{0k) - - 4>c 

i=l 

k 

- ^ {MOi) - (pc 


- k—1 . 

i=0 ' 


%+l\\l 


i=l 


Hence, given (44), we have 


\^(pa{dk) - (pa{d)'j 


< 


2^k 

which together with (43) yields the stated bound. 
8.4. Proof of Theorem [G]. 

Proof. Write r, = T{^.^{e),'^pl{e)). 


00-0 


□ 


[r^ = oo] = 1 - ^ P [r^ = j] , 
i=i 


and 


IP [T-e = j] = IP [(Amin(6'j) < 4(e) Or Amax(4) > V’*(e)) , > j - l] . 

Now we proceed as in the proof of LemmaGiven 0j-i, the optimality condition (37) 
becomes: there exists a matrix Aj, all entries of which belong to [— 1 , 1 ] (that can be taken 
as sign ( 4 )), such that 

0, = (! + (!- a)A7)-' (0J-1 +10-\ - i{S + {0f}^ - G,) + oAA,)) . 

As in the proof of Lemma we have. 


Amax(<5 + i0j\ — Gj) + AAj) < Amax(5') + p\\0jfi — Gj||oo + pA 


-1 


and Amin('S' + {0j-i — Gj) + AAj) > Amin(*S') — p\\0jfi — Gj||oo — pA. 


-1 


del 


where for A G A4 , PIloo = maxij |Ajj|. Therefore, with the same steps as in the proof 
of Lemma j2, we see that on the event {t > j — 1, ||Gj — 0~1 ]^||cxd < e}, Aniin(4) — ^(e), 
and Amax( 4 ) — P(^)- We conclude that. 


|Gj ^j_]^||oo ^ p j 1 


IP [a = j] < ip [Tc = j\Te > j - 1] < P 

We prove in Lemma pl5| the exponential bound 

P ||Gj - 6»^"_\||oo > e|Amin(4-i) ^ 4(e) < Sp^exp (-min(l, £^(e)e^/16)Aj_i) . 
Hence 

P [te = oo] > 1 — exp (— min(l,f^(e)e^/16)A'j_i) . (46) 

i>i 

We will now show that there exists a random variable 'I'*(e) such that on {t^ = +oo}, 

Amax(4 ) ^ ^*(^) foi" ah j > 0. 
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We first note that on {t^ > k}, Oq, ... ,9^ £ ip), and 9j = Prox^ i^j-i ~ ~ ®)) 

for j = 1,... ,k. We then apply Lemma 14 with 9 = 9j-i, 9 = 9j and H = Sj, to write 


9j-9 


< 


9,-9 


^ + 27 ^(pa{ 9 j) - (/>a(0)| 




Oj-1-0 


9-9„i:,-9-^,). 


We multiply both sides by and uses the fact that 

to write 


'-{'^£>11 


-9 


^ ^ ^ ^2 ) ^{'^£>1-1} 


6 »,-i - 9 


-27lK>i-i}(0-0„5],-07_\). (47) 


Recall that 9j = Prox,,, {9j-i — ^{S — Sj); a), and split 9 — 9j as 

9-9j = 9- T^{9j_i-a) + r^(0j-i; a) - 9j, 


(48) 


where T^{9j-i]a) = (^9j-i —'y{S — 9j},)] . It is well known that the proxi¬ 

mal operator is non-expansive—see (Bauschke and Combettes 2011, Propositions 12.26 
and 12.27). Hence 

(r^( 0 ,_i; a) - 9j, S,- - < ||r.,( 0 ,-i; a) - 0 ,||f 


. - 


s,- - 


< 7 


s, - o-\ 


We then set Vj '=^ ( 6 * — T.y{9j-i; a), T^j — 9j_p,), and use the last inequality, (48), 


-1 


and (47) to deduce that 




-9 


7 


< I 1 - -W 1 1 

1p^ 


2 I ■'■{^■£>1-1} 


9,.i-9 


(49) 


Summing (49) for j = 1 to k yields 


l{r£>fe} 

k>0 


9k - 9 


< 


-9 


+ 27 sup 
F k>l 


00 

i=i 
2 


k 

i=i 

S, - 0-\ 


-9 


+ C) 


(50) 
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del 


where C = 27 sup;j>;^ 


V Er=l 


_ -1 
J ''j -1 


s. -0 


. The bound (50) in 


turn means that on the event {r^ = oo}, for all j > 0, 


ll^lb < ll^lb + ^ ^ < i^uB + \/p{'^uB - E(e))^ + C- 

Hence, with 'I'*(e) min (^tpl{e),'ipuB + y/pi^l^UB — + C) j we have shown that 

{xe = oo} C {r(£*(e),'I'*(e)) = ooj, and the first part of the lemma follows from the 


bound 

Bound on E(T*(e)^). Clearly it suffices to bound E(C). Recall that Sj = ^ 

where zi:Nj N(O,0“}^). We easily calculate (See Lemma 15 for details) that on the 
event {r, > j - 1}, 

E 


s, - e-\ 


•E-i)=^(Tr( 0 -_\)^ + 


,-i .2 


3-1 

b -1 


and for G Af+(4(e), V'}(e)), Tr(6<^. + 6-^ ^ < 4(e) ^(p + p^). Hence 


E 


OO 


2 

oo 




A l{r>/-l} 

_j = l 

S, - 

F 

= E®: 

i=i 

M 

H 

V 

1 

5^1 - 

p'A. 


< 4(e) ^{P + P'^)'^^ < 


j=i 1 


by assumption. By Doob’s inequality (Hall and Heyde (1980) Theorem 2.2) applied to 
the martingale {Ei=i ^fc}, 




k 


E 

sup 

Eu 



k>l 

i=i 



= lim E 

N^OO 



k 


sup 

Ee- 


l<A:<Af 

i=i 



< 2 lim E^/2 
N—^OO 


N 

Ey, 

i=i 


2-1 


1/2 


2 4/E(7 


1 = 1 


Using again the facts that the proximal operator is non-expansive and 9 = T.y(9]a), we 


have \Vj\ < l{r,>j-i} 
above, we have 


0,-1 - 0 




E(U/) = E [E(U/|J}-_i)] < 4(e)-2(p + p2)iVriE _i} 


Therefore, with similar calculations as 


0,-1 - 0 


On {r, > j - 1}, 


Oj-i - 0 


< “ 0\\2 < y/p{'ipUB - 4(e)). Hence 


E(y. 2 ) < + -4(e))^ 1 


4(e)^ 


E-’ 
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which together with the assumption N- < oo, and the above calculation show that 


E 


SUPfc>i 


E =1 E 


< oo. 


Convergence of On ■ We sum (49) from j = 1 to A;, which gives, for all A: > 1: 

k 


L{'re>fc} 


Ok - 0 






i=i 


0,-1-0 


< 27 sup 
k>l 


k 

El 

i=i 


i=i 


S. - 6 » 




We have seen above that the term on the right-hand side of this inequality has a finite 

2 


expectation. This implies the series Xlfci 1 


OO 

J =1 


9,_1 - 0 


is finite almost surely, 


which in turn implies that on {r^ = 00 }, we necessarily have lim^ 9^ = 0, as claimed. 


□ 


8.5. Proof of Theorem O 


Proof. Taking the expectation on both sides on (49) yields 
2 ^ 


E 


4'r>i} 


0,-0 


- • ^ ^2 ' 




0 ,- 1-0 


+ 27 ^E 




E - 


I - 7,-1 


Iterating this inequality yields 


E 




Ok-O 




-0 




j=l 


1_ 


k-j 


E 


l{^>j-i}E 


— 3-1 

, 7-1 


S,- 


Recall that Sj = 7 ^ where N(O,0^-7i)- easily calculate (See 

Lemma 15 for details) that on the event {r > j — 1}, 

E 


i.i.d. 


Si - 6-li 


I IL-i) = 71 + 


T-1 


and for 6j G AI+(^, V’), Tr(0 - -|- 


3-1 


E 




Ok-O 


then follows. 


< ^ ^{p + P). The stated bound on the term 


□ 
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8.6. Proof of Theorem [Q]. 

Proof. We write r = t{£, fj). On {r > A:}, Oq, ... ,9^ G A4+(£, if), and = Prox.y(0j — 

7 ( 5 ' — Si+i; a) for i > 0. We apply Lemma 14 with H = Sj+i to write 


^*+1 ^ F - V* 

By iterating this bound, we obtain 


<( 1 -^ 


e^-e 


< 1 - 




9,-9 


9o-9 


+ 27 ( 0j+i — 9, — 9, 


+27 sup 
k>0 


9k - 9 




i=i 


fL 


k-j 


Sj + l 




(51) 


On {r(f,iA) = 00 }, supj>o 


9,-9 


is finite and if lim 


s,+i - 9: 


-1 


= 0 , the bound 


(51) would easily imply that lim^ 


9k- 9 


= 0. Hence the theorem is proved by showing 


that on {r = 00 }, lim^. ||Sfc_|_i — = 0. From (25), we write 

5^fc+i - 9)f^ = (1 - Cfc+i) (Sfc - + (1 - Ck+i){9kli - 9^^) + Ck+iVk+i, 

where 


r]k+i ^ Y1 ^ ' N(0, 0;- ^). 


N 


i.i.d. 


N 


k=l 


We expand this into 

(Si+i - ') = (1 - (St - «7i) + 4L ++fliL. 

where the remainders are given by 


R^l, = -l{r=k}il-Ck+l)^k, 


R^kh =' (1 - Ck)l{r>k-l}9kl, - (1 - Ck+l)l{r>k}9k\ 


-1 


3-1 


R^kh = iCk - Ck+l)lir>k-l}9kll 

„(4) del . 

Pk+1 Sfc+1 l{r>fc}l/A:+l • 


and 


Since l{r>fc,r=oo} = 1 {t=oo}) and l{r=fc,T=oo} = 0) it follows that for all n > 0, 


l{r=oo} (5ifc+l -9k^) = 1{t=oo} “ Cfc+l)(Sl - 9q 


k=l 


?0) _L K>0) I p(4)' 


+iH^oo}EK+^f+^n n (i-cm). 

j=l i=j+l 

Clearly, we have nfc=i(t “ Cfc+i) < exp Cfc+i^ —)• 0 as A: —)> 00 by (24), and if 

the series X)j>i finite on {r = 00 }, then by Kronecker lemma, 
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it would follow that [Rf^ + Rf' ) nj^=,+i(l “ Ci+i) 


(3) 


?(4) 


0, as k 


oo on 


{r = oo}. Hence, it suffices to prove that the series 
{r = oo}. 


We have Ylk=i = (1 “ Ci)1{t>o}^o ^ “ (1 “ Cfc+i)l{T>fc}^fc assumption that 


3—1 


Rf + Rf + Rf 


-1 


is finite on 


9k has a limit and 9k G A4+{i,'ilj) easily implies that Ylk^k ^ finite. Similarly, we have 

Ea 


R 


(3) 


< i ^Co < OO, and 


E 




k 




1 ^ 

N ^ 

k=l 


Zkz'k - 


-1 


< I ‘^{p + V^)'^Cl < oo. 


□ 


8.7. Proof of Theorem IIOL 


Proof. The proof follows (Mazumder and Hastie 2012a) with appropriate modifications. 


and we provide a brief sketch here. 

First Part: 

We start with the connected component decomposition of the non-zeros of 9. Let us 
assume that the rows/columns of the matrix 9 have been arranged such that it is block 
diagonal. We proceed by writing the KKT conditions of Problem 


-9 + S + Ti sign(6') -h 2t29 = 0, 


(52) 


where, ri = aAi and T 2 = ^^^^2 and sign(0) is a matrix where sign(-) is applied compo¬ 
nentwise to every entry of 9. Since 9 is block diagonal so is 9~^. If we take the 


entry of the matrix appearing in (52) such that i and j belong to two different connected 
components then: —{9~^)ij + 2T9ij = 0 which implies that Sij + ti sign(0jj) = 0. Thus we 
have: < ti for all pairs i,j such that they belong to two different connected compo¬ 

nents. Thus the binary matrix ((l(|sjj| > ti))) will have zeros for all i,j belonging to two 
different components Vr and for r 7 ^ s. The connected components of ((l(|sjj| > ti))) 
have a finer resolution than Vj,j = 1,... ,J and in particular J < J. 

Second Part: 

For the other part, let us start by assuming that the symmetric binary matrix ((l(|sij| > 
Ti))) breaks down into J many connected components; and let 9 = diag(0i,... , 9j) be a 
block diagonal matrix, where, each 9^ is obtained by solving Problem ^ restricted to the 
rth connected component Vr where, r = 1,..., J. For any i,j belonging to two different 
components Vr and with r / s we have that < ri and in addition, 9ij = 0 and 
{9~^)ij = 0. This implies that 9 satisfies the KKT conditions (52) and is hence a solution 


to Problem ([^. This in particular, implies that J > J and the connected components of 
Vj,j = l,...,J are a finer resolution than Vr,r = 1,..., J. 

Combining the above two parts, we conclude that the connected components of the two 
binary matrices ((l(|sjj| > ri))) and {{l{\9ij / 0))) are indeed equal. 

□ 
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8.8. Proof of Lemma 1111 

Proof. To see this we take the derivative of the objective function wrt 0 and set it to zero: 

-0-i + 5 + A0 = O. (53) 

Suppose that the sample covariance matrix S can be written as: 

S = UDU', 

where the above denotes the full eigen-value decomposition of S which is a p x p matrix. 
Let di denote the diagonals of D. We will show that the solution to Problem Q is of 
the form 6 = Udiag{a)U', where, diag(iT) is a diagonal matrix with the ith diagonal entry 
being cjj. 

Let us multiply both sides of (53) by U' and U on the left and right respectively. It is 
then easy to see that the optimal values of a can be computed as follows: 

— l/cTj di Xci = 0 

for alH = 1,... ,p. The above can be solved for every i separately leading to: 

-di cCf -\- 4A 


(Ti = 


2A 


Vi 


Thus we have the statement of Lemma [TTl 


□ 
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Appendix 


Appendix A. Related Work and Algorithms 

In this section we review some of the state-of-the art methods and approaches for the 
Glasso problem (Problem ([l|))- Problem ([I|) is a nonlinear convex semidefinite opti¬ 
mization problem ( Vandenberghe and Boy^ 1996) and off-the-shelf interior point solvers 
typically have a per-iteration complexity of O(p^) that stems from solving a typically dense 
system with 0{p^) variables (Vandenberghe et al., 1998). This makes generic interior point 


solvers inapplicable for solving problems with p of the order of a few hundred. 

A popular approach to optimize problem ([^ is to focus on its dual optimization problem, 
given by: 


maximize 


logdet(ii;) subject to ||5 — in||oo < A, 


(54) 


with primal dual relationship given hy w = 9 ^. The dual problem has a smooth function 
appearing in its objective. Many efficient solvers for Problem ([I|) optimize the dual Prob 


Mazumder and Hastie (2012b) and references therein 


lem (54) — see for example Banerjee et al. (2008); Friedman et al. (2007a); Lu (2009); 


In one of the earlier works, Banerjee et al. (2008) consider solving the dual Problem (54) 


They propose a smooth accelerated gradient based method (Nesterov, 2005) with complex¬ 
ity to obtain a e-accurate solution — the per iteration cost being 0{p^). They also 

proposed a block coordinate method which requires solving at every iteration, a box con¬ 
strained quadratic program (QP) which they solve using Interior point methods—leading 
to an overall complexity of 0{p^). 


The graphical lasso algorithm (Friedman et al., 2007b) is widely regarded as one of the 
most efficient and practical algorithms for Problem (|^. The algorithm uses a row-by-row 
block coordinate method that requires to solve a ii regularized quadratic program for 
every row/column—the authors use one-at-a-time cyclical coordinate descent to solve the 
QPs to high accuracy. While it is difficult to provide a precise complexity result for this 
method, the cost is roughly 0{p^) for (reasonably) sparse-proble ms with p nodes. For 
dense problems the cost can be as large as O(p^), or even more. Mazumder and Hastie 


(2012b) further investigate the properties of the graphical lasso algorithm, its operational 
characteristics and propose another block coordinate method for Problem ([^ that often 
enjoys better numerical behavior than graphical lasso. 


The algorithm proposed in Lu (2009) employs accelerated gradient based algorithms 


(Nesterov, 2005, 2013). The algorithm SMACS proposed in the paper has a per iteration 


complexity of 0{p^) and an overall complexity of O(^) to reach a e-accurate optimal 
solution. 

Li and Toh (2010) propose a specialized interior point algorithm for problem (0- By 


rewriting the objective as a smooth convex optimization problem by doubling the nnmber 
of variables, the paper proposes a scheme to scale interior point like methods np to p = 
2000 . 


Scheinberg et al. (2010) propose alternating direction based methods for the problem. 


the main complexity per iteration being 0{p^) associated with a full spectral decompo¬ 
sition of a p X p symmetric matrix and a matrix inversion. Yuan (2012) propose an 
alternating direction method for problem ([^, with per iteration complexity of O(p^). 
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Computational scalability of a similar type can also be achieved by using the alternat¬ 


ing direction method of multipliers ADMM Boyd et al. (2011) which perform spectral 


decompositions and/or matrix inversions with per iteration complexity 0{p^). 

Fairly recently, Hsieh et al. (2014) propose a Newton-like method for Problem 0 , the 
algorithm is known as QuiC. The main idea is to reduce the problem to iteratively solv¬ 
ing large scale regularized quadratic programs, which are solved using one-at-a-time 
coordinate descent update rules. The authors develop asymptotic convergence guaran¬ 
tees of the algorithm. It appears that several computational tricks and fairly advanced 
implementations in C++ are used to make the approach scalable to large problems. At 
the time of writing this paper, QUIC seems to be one of the most advanced algorithms 
for Glasso. Oztoprak et al. (2012) propose a related approach based on a Newton-like 


quadratic approximation of the log-determinant function. 


Appendix B. Additional Computational Details 

We initialize all the solvers using the diagonal matrix obtained by taking the inverse 
sample variances. For all the simulated-data experiments, the step-size and the Monte 
Carlo batch-size are taken as follows. The step-size is set to 7 = 10, the Monte Carlo 
batch-size is set to = [30 -|- at iteration k. Additionally, for Algorithm we use 
N = 400, and Cfc = . 

For p = 1000, the values of the regularization parameters were taken as (a, A) = 
(0.89,0.01). We computed 9 (the target solution to the optimization problem) by running 
the deterministic algorithm for 1000 iterations. 

The size of the largest component is 967, one component had size two with all other 
components having size one. In this case, the splitting offered marginal improvements 
since the size of the maximal component was quite close to p. 

For p = 5000 the values of the regularization parameters were taken as (a. A) = 
(0.93,0.0085) and we computed 9 (the target solution to the optimization problem) by 
running the deterministic algorithm for 1000 iterations. 

For the case, p = 5, 000 splitting leads to 76 connected components. The size of the 
largest component is 4924 with all other components having size one. 

For p = 10, 000 , the values of the regularization parameters were taken as (a. A) = 
(0.96,0.01). We computed 9 (the target solution to the optimization problem) by running 
the deterministic algorithm for 500 iterations. 

For the case, p = 10, 000 splitting leads to 1330 connected components. The size of the 
largest component is 8670, one component has size two with all other components having 
size one. 

We present the results for the cases p = 5, 000 and p = 10, 000 in Table 

For the real-data example, the stochastic algorithms are set up as follows. The step-size 
is set to 7 = 5 X 10“®, the Monte Carlo batch-size is set to = [100 -|- at iteration 
k. Additionally, for Algorithm we use N = 200, and Ck = k~^'^‘^. 
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Appendix C. Some Technical Lemmas and Proofs 


Lemma 12. Consider the function f{0) = — logdet0 + Tr{9S), 6 G A1+. Take 0 < £ < 
< oo. If 9 £ if), and H £ M. are such that 9 + H £ fj), then 

fi9) + {S- 9-\H) + ^\\Hf <fi9 + H)< f{9) + {S - 9-\h) + 

Proof. First notice that i/^) is a convex set. Hence for all t £ [0,1], 0 + tH = 

(1 — t)9 + t{9 + H) £ M-+{£, Ip). Then by Taylor expansion we have, 

logdet(6* + Lf) = logdet 0 + if) + f (^{9 + tH)~^ — 9~^, H) dt. 


This gives 


f{9 + H)- f{9) -{S-9-\H) = - r {{9 + tH)-^ - 9-\ H) 

Jo 


dt. 


However {9 + tH) ^ — 9 ^ = —t9 ^H{9 + tH) Therefore, if 0 = Yl^j=i denotes 

the eigen-decomposition of 0 , we have 

-{{9 + tH)-^ -9-^,H) = tH{9-^H[9 + tH)-^H) 

p 

= tJ2^j^UjH{9 + tH)-^Huj 
i=i 
p 


< 


t 




i=i 


|2 — '' II ;t||2 

■ 


Similarly calculations gives 


The lemma is proved. 

We also use the following well known property of the proximal map. 
Lemma 13. For all 9, d £ A4, and for all a G [0,1], 7 > 0, 

gQ(Prox.y(0; a)) < gaiiJ) I — (i? — Prox.y(0; a), Prox.y(0; a) — 9). 

7 

Proof. See ( |Bauschke and Combettes , [2011 Propositions 12.26 and 12.27). 
Lemma 12 amd Lemma [T^ together give the following key result. 


□ 


□ 


Lemma 14. Fix 0 < £ < ^p < 00 , and 7 G (0,i^]. Suppose that 9,9 £ J^+{£,ip), and 


H £ Jv[ are such that 9 Prox.y(0 — 7(5 — H); a) £ Al+(i, i/>). Then 


9-9 

^ < 27 (j)a{9) - (pa{0)'j + 

9-9 

2 

F 




< 

(i-A) 

9-9 




V V’V 



+ 2j(9-9,H- 


n-l 


where we recall that (pa{d) = f{d) + ga{d)- 
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Proof. Set f{9) = — logdet0 + Tr(05), 9 G j\4+. By Lemma 12 


f{9) < f{9) + {S-9-\9-9) + ^\\9- 9\\l. 

Subtracting f{9) from both sides of the above inequality and re-arranging gives 
fi0)-fi9) < \f{9) + (s-9-\9-9)-f{9) 


(55) 


+ (^s-e-\ 0 -e) + P\\S- 0 \\l. 


(56) 


Since 9,9 G A4+{i,'tp), the strong convexity of 0 i—)• — logdet0 -|- Jr{9S) established in 

. 2 

— 9 . Using this in 


Lemma 


12 


(56) gives 


implies that f{9) + (^S — 9 ^,9 — 9'^ — f{9) < —^ 




By Lemma 13 


9-9 


1 


^ + {S-9-\9-9) + -\\9-9\ 


1 


(57) 


gai0)-gai0) < -{0-0,0-{0-i{S-H))), 
1 


= -{9-9,9-9) + {9-9,S-H). 


(58) 


We combine (57) and (58) and re-arrange to deduce that 


(l>a{0) - (t>a{9) < 


\ ^ (^9 - 9,29 - 9 - 9^ + (^9 - 9, H - 9-^ 


2 \7 

Since 4>ai0) > 4>a{0)-, we conclude that 


9-9 


2 1 
F 27 


9-9 


+ (9-9,H-9 


-1 


(59) 


9-9 


< 27 [(t)a{0) - 4>a{9)] -h 


9-9 




9-9 


+ 2-i{9 -9,H - 9 - 


as claimed. 


□ 


Lemma 15. Take £ > 0, and 9 G A4+{£). Let zi-,n *~' N{0,9 ^), and set Gn = 
Then 


E 


\Gn-9-^\\1 


< 


p + p 

Ni^ ’ 


and for any e > 0 such that £e < 4, 

P (IIGat — 6 *“^||oo > £) < 4p^exp (—min(l,£^e^/16)A^) . 
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Proof. 


E 



II 

M 


= -yF 

N ^ 

{ziZi)j^k ^j,k') 


j,k 

V i=l / 




^ E ^ +ii»- 

l,k 


- 1 || 2 \ ^ f fP 




For the exponential bound, we reduce the problem to an exponential bound for chi- 


squared di stributions, and apply the following corollary of Lemma 1 of Laurent and Mas- 


sart 


(2000). Let Wi-N the chi-square distribution with one degree of freedom. For 


any x G [ 0 , 1 ], 


N 


- 1 ) 


k=l 


> Ay/xN 


< 2 e 


-Nx 


(60) 


For 1 < i, j < p, arbitrary, set = Zk,iZkj, and aij = 0^^^. Suppose that i ^ j. It is 
easy to check that 


E[. 

k=l 


7(^1 — IT- ■ 


N 




k=l 


1 ^ 


k=l 


Notice that z^^i z^j ~ N(0, an + ajj + 2aij), and z^^i — Zkj ~ N(0, an + ajj — 2aij). It 
follows that for all x > 0 , 
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where IFi:Ar ~' xf, the chi-square distribution with one degree of freedom. The last 
inequality uses the fact that an + ajj -|- 2aij = u'9~^u < |||u|p < |, where u is the vector 
with 1 on components i and j and zero everywhere else (similarly for an + ajj — 2aij by 
putting —1 on the j-th entry). Then we apply (60) to obtain 
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When i = j, the bound 1 
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forward. The lemma follows from a standard union-sum argument. 
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